engram Retrieval Evaluation
Benchmarking persistent memory retrieval across knowledge categories
Evaluation summary
386 domain questions across 6 knowledge categories. Two metrics: Token F1 (exact lexical overlap) and LLM Judge (binary semantic correctness via Gemini 2.5 Flash). Ceiling = full-context injection upper bound (no retrieval step).
Dataset & metrics
The evaluation dataset contains 386 question-answer pairs hand-authored across six knowledge categories that mirror engram’s storage taxonomy:
| Category | Questions | What is tested |
|---|---|---|
| decisions | 72 | Architectural choices, rationale captured in memory |
| solutions | 83 | Bug fixes and workarounds retrieved accurately |
| patterns | 67 | Code conventions recalled with correct detail |
| bugs | 51 | Known issues surfaced without hallucination |
| insights | 62 | Non-obvious findings retrieved in context |
| procedures | 51 | Step-by-step workflows recalled in order |
Token F1 treats the answer as a bag of tokens and measures precision/recall against the reference. It rewards exact term overlap and is deterministic. LLM Judge uses Gemini 2.5 Flash to score each answer 0 or 1 (semantically correct or not), multiplied by 100. Because it is binary, it has ~2–3 pt natural run-to-run variance — the CI regression gate is set at 2.5 pts to avoid false alarms.
The ceiling answers every question with the full knowledge base injected as context (no retrieval). It represents the upper bound for a perfect retriever with unlimited context.
Results across versions
The shaded bands around each ceiling show ±1 pt, roughly the natural measurement noise. Versions v6 and v6b cross the judge ceiling band — this is expected and explained in the methodology section.
What changed at each version
v1 → v2 −0.1
Output formatting cleanup and smoke-test tuning. No retrieval change — confirms the baseline retrieval quality was already the bottleneck, not output formatting.
v2 → v3 +14.9
HyDE (Hypothetical Document Embedding) — instead of embedding the raw question, a hypothetical answer is generated and embedded. This bridges the vocabulary gap between a terse query and a verbose knowledge entry. Simultaneously introduced multi-category extraction, letting each conversation contribute to decisions, solutions, patterns, bugs, insights, and procedures simultaneously rather than a single bucket.
v3 → v3-full +2.9
Scaled from a smoke-test subset (52 questions) to the full 385-question dataset. The gain reflects better category coverage; procedures in particular had been underweighted in the smoke set.
v3-full → v4 +0.6
Post-merge housekeeping: clippy fixes, rustfmt, no retrieval changes. Small gain from cleaner extraction output.
v4 → v5 −0.6
Noise run — no code changes. Demonstrates the ~2–3 pt natural judge variance across identical runs. The F1 drop is within measurement error for the dataset size.
v5 → v6 +3.3
Three simultaneous improvements: (1) BM25 hybrid retrieval with RRF rescues exact-match queries that dense embeddings underweight; (2) procedures QA dataset regenerated — the original 50 questions used vague ordinal references (“the first procedure”) that made retrieval impossible; regenerated with named references; (3) ADD artifact root cause fixed — the extraction LLM was occasionally prefixing entries with resolver keywords (ADD/NOOP), polluting knowledge files.
v6 → v6b +0.6
Independent confirmation run. Confirms v6 results are reproducible within expected variance.
Category deep-dive
What the categories tell us
Procedures shows the most dramatic improvement (+27.8 F1, baseline→v6b). The gain came almost entirely from fixing the QA dataset: the original questions referenced procedures by ordinal position (“the third procedure”) rather than by name, making retrieval a lottery. Once regenerated with named references, the retriever could anchor on distinctive phrases.
Decisions and solutions are closest to ceiling, suggesting the retrieval quality for well-structured factual entries is largely solved at the current corpus size.
Bugs and patterns have the largest remaining F1 gaps (−11 and −10 below ceiling). Bug reports tend to have diverse phrasing with technical identifiers; patterns are often abstract descriptions where synonym variation defeats token F1 even when the semantic answer is correct.
Insights scores highest on the judge metric (25.8) despite a moderate F1 (50.3) — the LLM judge finds the answers semantically correct even when the exact wording differs. This suggests F1 undervalues retrieval quality for this category.
Retrieval methodology
Dense embeddings + HyDE
Each knowledge entry is embedded using a sentence embedding model. At query time, instead of embedding the raw question, the system generates a hypothetical answer — a plausible but fabricated response — and embeds that. The intuition: a hypothetical answer lives in the same vector space as real answers, bridging the vocabulary gap between a short question and a verbose knowledge entry.
Q: "what did we decide about async runtimes?"
Hypothetical: "We decided to use tokio for all async operations because..."
Embed(hypothetical) → cosine search → top-k entries → answer
BM25 hybrid with Reciprocal Rank Fusion
Dense retrieval excels at semantic similarity but misses exact technical terms — version numbers, function names, error codes. BM25 is purely lexical and catches these. The two ranked lists are fused with RRF:
\[\text{score}(d) = \frac{1}{60 + r_{\text{dense}}(d)} + \frac{1}{60 + r_{\text{BM25}}(d)}\]
The constant 60 dampens the influence of exact rank position, making the fusion robust to small rank differences at the top. BM25 uses \(k_1 = 1.5\), \(b = 0.75\) (standard defaults).
Impact: not-found rate dropped from 18.4% → 13.2% — 20 additional questions now return an answer where dense-only returned nothing.
Full results table
| Version | Token F1 | LLM Judge | % F1 ceil | % J ceil | Not-found | N |
|---|---|---|---|---|---|---|
| Baseline | 34.7 | 9.6 | 51% | 69% | — | 52 |
| v1 | 38.9 | 8.8 | 57% | 63% | — | 385 |
| v2 | 38.8 | 9.4 | 57% | 67% | — | 385 |
| v3 | 53.7 | 12.2 | 79% | 87% | — | 90 |
| v3-full | 56.6 | 15.3 | 83% | 109% | — | 385 |
| v4 | 57.2 | 14.5 | 84% | 104% | 18.4% | 385 |
| v5 | 56.6 | 13.0 | 83% | 93% | 18.2% | 385 |
| v6 | 59.9 | 16.1 | 88% | 115% | 13.2% | 386 |
| v6b | 60.5 | 16.3 | 89% | 116% | 13.2% | 386 |
| Ceiling | 67.9 | 14.0 | — | — | — | 385 |
Methodology notes
Why judge can exceed ceiling. The ceiling is computed by injecting the entire knowledge base as context. With thousands of tokens of mixed content, the model has to locate and extract the relevant fragment. RAG retrieves a small, focused set of entries — the model sees less noise and tends to give cleaner, more confident answers that the judge scores higher.
Judge variance. Binary scoring (0/1 per question, ×100) means each question is worth 100/N ≈ 0.26 pts. A swing of ±10 questions changes the judge score by ±2.6 pts. Two identical runs can therefore differ by 2–3 pts with no code changes. The CI regression gate is set at 2.5 pts to account for this.
Token F1 limitations. F1 rewards token overlap but penalises paraphrases. A correct answer that uses synonyms scores lower than a partially wrong answer that reproduces the reference verbatim. This is why judge and F1 sometimes diverge sharply (e.g. insights: F1=50, judge=26 → the answers are semantically right but worded differently).
What to improve next
- Bugs and patterns have the largest gaps (~11 pts each). Both suffer from high linguistic variation. Cross-encoder re-ranking — running a lightweight classifier over retrieved candidates — would help here by scoring semantic relevance rather than surface form.
- Query expansion (generating multiple query variants and merging results) could help for short, ambiguous questions in the bugs category.
- Longer retrieval context via
engram inject --lines 360allows the answering model to see more candidates before committing to a response, at the cost of increased latency.